21 research outputs found

    An Investigation of Evaluation Metrics for Automated Medical Note Generation

    Full text link
    Recent studies on automatic note generation have shown that doctors can save significant amounts of time when using automatic clinical note generation (Knoll et al., 2022). Summarization models have been used for this task to generate clinical notes as summaries of doctor-patient conversations (Krishna et al., 2021; Cai et al., 2022). However, assessing which model would best serve clinicians in their daily practice is still a challenging task due to the large set of possible correct summaries, and the potential limitations of automatic evaluation metrics. In this paper, we study evaluation methods and metrics for the automatic generation of clinical notes from medical conversations. In particular, we propose new task-specific metrics and we compare them to SOTA evaluation metrics in text summarization and generation, including: (i) knowledge-graph embedding-based metrics, (ii) customized model-based metrics, (iii) domain-adapted/fine-tuned metrics, and (iv) ensemble metrics. To study the correlation between the automatic metrics and manual judgments, we evaluate automatic notes/summaries by comparing the system and reference facts and computing the factual correctness, and the hallucination and omission rates for critical medical facts. This study relied on seven datasets manually annotated by domain experts. Our experiments show that automatic evaluation metrics can have substantially different behaviors on different types of clinical notes datasets. However, the results highlight one stable subset of metrics as the most correlated with human judgments with a relevant aggregation of different evaluation criteria.Comment: Accepted to ACL Findings 202

    ACI-BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation

    Full text link
    Recent immense breakthroughs in generative models such as in GPT4 have precipitated re-imagined ubiquitous usage of these models in all applications. One area that can benefit by improvements in artificial intelligence (AI) is healthcare. The note generation task from doctor-patient encounters, and its associated electronic medical record documentation, is one of the most arduous time-consuming tasks for physicians. It is also a natural prime potential beneficiary to advances in generative models. However with such advances, benchmarking is more critical than ever. Whether studying model weaknesses or developing new evaluation metrics, shared open datasets are an imperative part of understanding the current state-of-the-art. Unfortunately as clinic encounter conversations are not routinely recorded and are difficult to ethically share due to patient confidentiality, there are no sufficiently large clinic dialogue-note datasets to benchmark this task. Here we present the Ambient Clinical Intelligence Benchmark (ACI-BENCH) corpus, the largest dataset to date tackling the problem of AI-assisted note generation from visit dialogue. We also present the benchmark performances of several common state-of-the-art approaches

    Explicit and Implicit Semantic Ranking Framework

    Full text link
    The core challenge in numerous real-world applications is to match an inquiry to the best document from a mutable and finite set of candidates. Existing industry solutions, especially latency-constrained services, often rely on similarity algorithms that sacrifice quality for speed. In this paper we introduce a generic semantic learning-to-rank framework, Self-training Semantic Cross-attention Ranking (sRank). This transformer-based framework uses linear pairwise loss with mutable training batch sizes and achieves quality gains and high efficiency, and has been applied effectively to show gains on two industry tasks at Microsoft over real-world large-scale data sets: Smart Reply (SR) and Ambient Clinical Intelligence (ACI). In Smart Reply, sRanksRank assists live customers with technical support by selecting the best reply from predefined solutions based on consumer and support agent messages. It achieves 11.7% gain in offline top-one accuracy on the SR task over the previous system, and has enabled 38.7% time reduction in composing messages in telemetry recorded since its general release in January 2021. In the ACI task, sRank selects relevant historical physician templates that serve as guidance for a text summarization model to generate higher quality medical notes. It achieves 35.5% top-one accuracy gain, along with 46% relative ROUGE-L gain in generated medical notes

    Information extraction from clinical and radiology notes for liver cancer staging

    No full text
    Thesis (Ph.D.)--University of Washington, 2016-06Medical practice involves an astonishing amount of variation across individual clinicians, departments, and institutions. Adding to this condition, with the exponential pace of new discoveries in biomedical research, medical professionals, often understaffed and overworked, have little time and resources to analyze or incorporate the latest research into clinical practice. The accelerated adoption of electronic medical records (EMRs) brings about great opportunities to mitigate these issues. In computable form, large volumes of medical information can now be stored and queried, so that optimization of treatments based on patient characteristics, institutional resources, and patient preferences may be data driven. Thus, instead of relying on the skillsets of patients' support network and medical teams, patient outcomes can at least have some statistical guarantees. In this dissertation, we focused specifically on the task of hepatocellular carcinoma (HCC) liver cancer staging using natural language processing (NLP) techniques. Staging, or categorizing cancer patients by extent of diseases, is important for normalizing over patient characteristics. Normalized stages, can then be used to facilitate research in evidence-based medicine to optimize for treatments and outcomes. NLP is necessary, as with other clinical tasks, a majority of staging information is trapped in free text clinical data. This thesis proposes an approach to liver cancer stage phenotype classification using a mixture of rule-based and machine learning techniques for text extraction. Included in this approach is a careful, layered design for annotation and classification. Each constituent part of our system was characterized by detailed quantitative and qualitative analysis. Two important modules in this thesis are a framework for normalizing text evidence related to specific conditions and an algorithm for tumor reference resolution. The overall results of our system revealed an F1 performance of 0.55, 0.50, 0.43 for AJCC, BCLC, and CLIP liver cancer stages, respectively. Although outperforming baseline classifications, these accuracies are not viable for clinical use. Analysis of error suggests that performance for some constituent stage parameters would improve through additional annotation. However, one identified crippling bottleneck was the requirement of reference resolution and discourse-level reasoning to determine the number of tumors in a patient, a crucial part of cancer staging. Still our work provides a methodology to classify a complex phenotype, whose strength includes its interpretability and modularity while maintaining ability to scale and improve with greater amounts of data. Furthermore, submodules of our system, for which perform at higher accuracies, may be used as tools to decrease annotation costs

    Overview of the ImageCLEF 2023 ::multimedia retrieval in medical, social media and internet applications

    No full text
    This paper presents an overview of the ImageCLEF 2023 lab, which was organized in the frame of the Conference and Labs of the Evaluation Forum – CLEF Labs 2023. ImageCLEF is an ongoing evaluation event that started in 2003 and that encourage the evaluation of the technologies for annotation, indexing and retrieval of multimodal data with the goal of providing information access to large collections of data in various usage scenarios and domains. In 2023, the 21st edition of ImageCLEF runs three main tasks: (i) a medical task which included the sequel of the caption analysis task and three new tasks, namely, GANs for medical images, Visual Question Answering for colonoscopy images, and medical dialogue summarization; (ii) a sequel of the fusion task addressing the design of late fusion schemes for boosting the performance, with two real-world applications: image search diversification (retrieval) and prediction of visual interestingness (regression); and (iii) a sequel of the social media aware task on potential real-life effects awareness of online image sharing. The benchmark campaign was a real success and received the participation of over 45 groups submitting more than 240 runs

    Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation

    No full text
    Abstract Recent immense breakthroughs in generative models such as in GPT4 have precipitated re-imagined ubiquitous usage of these models in all applications. One area that can benefit by improvements in artificial intelligence (AI) is healthcare. The note generation task from doctor-patient encounters, and its associated electronic medical record documentation, is one of the most arduous time-consuming tasks for physicians. It is also a natural prime potential beneficiary to advances in generative models. However with such advances, benchmarking is more critical than ever. Whether studying model weaknesses or developing new evaluation metrics, shared open datasets are an imperative part of understanding the current state-of-the-art. Unfortunately as clinic encounter conversations are not routinely recorded and are difficult to ethically share due to patient confidentiality, there are no sufficiently large clinic dialogue-note datasets to benchmark this task. Here we present the Ambient Clinical Intelligence Benchmark (aci-bench) corpus, the largest dataset to date tackling the problem of AI-assisted note generation from visit dialogue. We also present the benchmark performances of several common state-of-the-art approaches
    corecore